Context
Coffee roasting is the process of turning green coffee beans into brown ones. Brown coffee beans can be made in a variety of methods, which also influences the flavor of the end product. A roasting instrument is basically a convection oven. It is a mechanism of inflicting heat energy into the raw product which makes the product consumable.
And the price of coffee is heavily influenced by the quality of the beans after roasting. As a result, the cost can be determined depending on the quality of the beans after roasting.
The rising automation in the manufacturing business necessitates the automation of quality inspection of output products with minimal human intervention. Quality inspectors in businesses examine product quality after it is manufactured to ensure that it meets industry standards.
Each product's quality inspection is a time-consuming manual process, and a low-quality product wastes upstream factory capacity, consumables, labor, and money. With the emerging AI trend, companies are looking to leverage machine learning-based technologies to automate material quality inspection during the manufacturing process to reduce human intervention while achieving human-level or better accuracy.
Objective
A roasting corporation named "KC Roasters" has engaged you to predict the quality of a roasting instrument's outputs, which will be used to determine the price of coffee beans. The quality value ranges from 0 to 100 with 0 being the worst and 100 being the best and the higher the quality of the beans, the higher the price.
The coffee roasting instrument used by Roasters is divided into five equal-sized compartments, each with three temperature sensors. 3 sensors have been installed at 3 different locations to be able to capture temperature at different locations inside the chamber.
Additionally, the height of raw material (volume entering the chamber) and relative humidity of roasted material is provided
The data shared consists of 17 predictor variables and a continuous target variable, and the aim is to build a Regression model which can accurately predict the quality of the product. After finding out the quality, the company can decide the cost of beans effectively.
Data Description
Roasters.csv - The Dataset consists of values of temperature of different chambers of the roasting instrument as collected by sensors.
Data Dictionary:
1. Problem Definition
Business Context:
Coffee roasting is a key step in producing high-quality coffee beans. The roasting process impacts flavor, aroma, and overall coffee quality.
A company, KC Roasters, wants to automate quality assessment of roasted coffee beans to optimize pricing and reduce waste.
The goal is to build a regression model that predicts the quality of roasted coffee beans based on temperature readings from sensors, raw material height, and roasted coffee humidity.
Higher quality → Higher pricing for coffee beans.
Technical Context:
Exploratory Data Analysis - Observations on the shape of data, data types of various attributes, missing values, statistical summary. -
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import drive
# 1. Mount Google Drive
drive.mount('/content/drive')
# 2. Loading Dataset from Google Drive
import pandas as pd
# Defining file path
file_path = '/content/drive/MyDrive/Colab_Notebooks/Featurization_Model_Selection_Tuning/Data.csv'
# Loading dataset
df = pd.read_csv(file_path)
# Displaying first few rows
df.head()
# =======================
# 3. Displaying Dataset Shape
# =======================
print("Dataset Shape:", df.shape) # Printing the number of rows and columns
# =======================
# 4. Checking Data Types
# =======================
print("\nData Types:")
print(df.dtypes) # Displaying the data types of each column
# =======================
# 5. Checking for Missing Values
# =======================
missing_values = df.isnull().sum() # Calculating missing values per column
print("\nMissing Values:")
print(missing_values[missing_values > 0]) # Printing columns with missing values
# Filling missing values with column mean to prevent data loss
df.fillna(df.mean(), inplace=True)
# =======================
# 6. Displaying Statistical Summary of the Data
# =======================
print("\nStatistical Summary:")
print(df.describe()) # Generating summary statistics for numerical columns
# =======================
# 7. Creating a Correlation Heatmap
# =======================
plt.figure(figsize=(12, 8)) # Setting figure size
sns.heatmap(df.corr(), annot=False, cmap='coolwarm', linewidths=0.5) # Plotting heatmap to show feature correlation
plt.title("Feature Correlation Heatmap") # Adding title to the heatmap
plt.show() # Displaying the heatmap
Temperature sensors within the same chamber (e.g., T_data_1_1, T_data_1_2, T_data_1_3) show a high correlation (close to 1.0), indicating redundancy.
This suggests that some of these features can be aggregated (e.g., mean temperature per chamber) to reduce multicollinearity in modeling.
The quality score does not exhibit strong correlation with any individual temperature sensor.
This suggests that temperature alone might not be a direct determinant of coffee quality, and other factors (e.g., humidity, raw material height) may play a more significant role.
# ===============================================================
# 8. Plotting Distribution of Quality Score - UNIVARIATE ANALYSIS
# ===============================================================
plt.figure(figsize=(8, 5)) # Setting figure size
sns.histplot(df['quality'], bins=30, kde=True, color='blue') # Creating histogram for quality score distribution
plt.title("Distribution of Quality Score") # Adding title
plt.xlabel("Quality Score") # Labeling X-axis
plt.ylabel("Frequency") # Labeling Y-axis
plt.show() # Displaying the plot
Right-Skewed Distribution with Peaks:
Majority of Quality Scores Between 40-90:
Small Bumps Indicating Mode Clusters:
Outliers on Both Ends:
#================================================================================================
# Step 9: Plotting and displaying boxplots for raw material height and humidity to detect outliers
# ================================================================================================
fig, ax = plt.subplots(1, 2, figsize=(14, 5)) # Creating subplots for boxplots
sns.boxplot(y=df['H_data'], ax=ax[0], color='green') # Creating boxplot for raw material height
ax[0].set_title("Boxplot of H_data (Raw Material Height)") # Adding title
sns.boxplot(y=df['AH_data'], ax=ax[1], color='orange') # Creating boxplot for humidity
ax[1].set_title("Boxplot of AH_data (Humidity)") # Adding title
plt.show() # Displaying the boxplots
# PERFORMING BIVARIATE ANALYSIS
#================================================================================================
# Step 10: Creating and displaying a scatter plot to visualize the relationship between temperature and quality
#================================================================================================
plt.figure(figsize=(12, 6)) # Setting figure size
for i in range(1, 6):
plt.scatter(df[f'T_data_{i}_1'], df['quality'], label=f'Chamber {i} Sensor 1', alpha=0.5) # Creating scatter plot for each chamber
plt.xlabel("Temperature (Sensor Readings)") # Labeling x-axis
plt.ylabel("Quality") # Labeling y-axis
plt.title("Scatter Plot: Temperature vs Quality") # Adding title
plt.legend() # Displaying legend
plt.show() # Displaying the scatter plot
Distinct Temperature Clusters for Each Chamber:
Wide Spread of Quality Scores at Each Temperature Level:
Possible Non-Linear Relationships:
Chamber 5 (Purple) Appears to Have a Higher Spread of Quality:
Potential Anomalies at High Temperatures:
#================================================================================================
# Step 11: Performing feature engineering by computing the mean temperature per chamber to create new insights
#================================================================================================
for i in range(1, 6):
df[f'T_mean_{i}'] = df[[f'T_data_{i}_1', f'T_data_{i}_2', f'T_data_{i}_3']].mean(axis=1) # Calculating mean temperature per chamber
# Step 12: Displaying the newly created feature columns with mean temperatures per chamber for validation
print("\nNew Feature Columns (Mean Temperature per Chamber):")
print(df[[f'T_mean_{i}' for i in range(1, 6)]].head()) # Showing first few rows of the new feature columns
Temperature Gradation Across Chambers:
Consistency Across Rows:
Potential for Feature Reduction:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from scipy.stats import zscore
# Step 1: Displaying the shape of the dataset to understand its dimensions
print("Dataset Shape:", df.shape)
# Step 2: Checking and displaying the data types of each column to ensure correct formatting
print("\nData Types:")
print(df.dtypes)
# Step 3: Data Pre-processing - I have already done this as part of EDA but redoing it again.
# Handling missing values by filling them with the column mean
df.fillna(df.mean(), inplace=True) # Ensuring no data loss due to missing values
# Step 4: Handling Outliers
# Identifying outliers using boxplots
plt.figure(figsize=(12, 5))
sns.boxplot(data=df[['H_data', 'AH_data']])
plt.title("Boxplot for Outlier Detection in H_data and AH_data")
plt.show()
# Identifying and removing extreme outliers beyond 1.5*IQR
outliers = {}
for col in ['H_data', 'AH_data']:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers[col] = df[(df[col] < lower_bound) | (df[col] > upper_bound)][col]
df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
# Displaying removed outliers from IQR method
for col, outlier_values in outliers.items():
print(f"\nRemoved Outliers from {col} (IQR Method):")
print(outlier_values.to_list())
# Applying Z-score method for additional outlier detection
z_scores = df[['H_data', 'AH_data']].apply(zscore) # Computing Z-scores for selected columns
threshold = 3 # Setting threshold for outliers
z_outliers = {}
for col in ['H_data', 'AH_data']:
z_outliers[col] = df[np.abs(z_scores[col]) > threshold][col] # Identifying outliers
df = df[np.abs(z_scores[col]) <= threshold] # Removing outliers
# Displaying removed outliers from Z-score method
for col, outlier_values in z_outliers.items():
print(f"\nRemoved Outliers from {col} (Z-Score Method):")
print(outlier_values.to_list())
# Step 5: Splitting Data into Train and Test Sets
X = df.drop(columns=['quality']) # Features
y = df['quality'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 6: Feature Scaling
# Applying Standardization (Z-score scaling)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Step 7: Ensuring No Data Leakage
# Verifying that target variable (quality) is not used as a feature
if 'quality' in X.columns:
print("Warning: Possible data leakage detected!")
else:
print("No data leakage detected: Target variable is not included as a feature.")
# Step 8: Displaying pre-processed data information
print("\nProcessed Data Overview:")
print("Train Data Shape:", X_train_scaled.shape)
print("Test Data Shape:", X_test_scaled.shape)
print("First 5 rows of scaled train data:")
print(pd.DataFrame(X_train_scaled, columns=X.columns).head())
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.metrics import mean_squared_error, r2_score
from scipy.stats import zscore
# Step 1: Building Regression Models
models = {
"Linear Regression": LinearRegression(),
"Decision Tree": DecisionTreeRegressor(),
"Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
"Bagging Regressor": BaggingRegressor(n_estimators=100, random_state=42),
"Gradient Boosting": GradientBoostingRegressor(n_estimators=100, random_state=42),
"AdaBoost Regressor": AdaBoostRegressor(n_estimators=100, random_state=42)
}
# Step 2: Training and Evaluating Models
results = {}
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
results[name] = {"MSE": mse, "R2 Score": r2}
print(f"\n{name} Performance:")
print(f"Mean Squared Error: {mse}")
print(f"R2 Score: {r2}")
# Step 3: Displaying Model Performance Summary
results_df = pd.DataFrame(results).T
print("\nModel Performance Summary:")
print(results_df)
Linear Regression Performs Poorly
Decision Tree Shows Significant Improvement
Random Forest and Bagging Regressor Perform Best
Gradient Boosting Performs Worse Than Expected
AdaBoost Performs the Worst
Reasons for Selecting These 3 Models for Hyperparameter Tuning
Previous Performance: Achieved the highest R2 score ( 0.926 ) and lowest MSE ( 19.5 )
Previous Performance: Underperformed (R2 ~0.158, MSE ~221.7), but boosting models often require tuning.
Expected Impact of Tuning
Random Forest & Bagging: Expected to further minimize MSE and improve R2 with optimized parameters.
Tune the chosen models wrt the metric of interest - Check the performance of the tuned models
from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBRegressor
from sklearn.model_selection import RandomizedSearchCV
# Selecting Top 3 Models for Hyperparameter Tuning
# Based on previous performance, we choose:
# 1. Random Forest Regressor - Best R² Score and low MSE
# 2. Bagging Regressor - Similar performance to Random Forest
# 3. XGBoost Regressor - Optimized for GPU acceleration
best_models = {
"Random Forest": RandomForestRegressor(random_state=42),
"Bagging Regressor": BaggingRegressor(random_state=42),
"XGBoost": XGBRegressor(tree_method='hist', random_state=42)
}
# Defining Hyperparameter Grids
param_grids = {
"Random Forest": {
'n_estimators': [50, 100, 200],
'max_depth': [10, 20, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
},
"Bagging Regressor": {
'n_estimators': [50, 100, 200],
'max_samples': [0.5, 0.7, 1.0],
'max_features': [0.5, 0.7, 1.0]
},
"XGBoost": {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 10]
}
}
# Performing Grid Search for Hyperparameter Tuning
best_params = {}
results = {}
for name, model in best_models.items():
print(f"\nTuning {name}...")
grid_search = RandomizedSearchCV(model, param_distributions=param_grids[name],
n_iter=10, scoring='r2', cv=3, n_jobs=-1, random_state=42)
grid_search.fit(X_train_scaled, y_train)
best_params[name] = grid_search.best_params_
print(f"Best Parameters for {name}: {grid_search.best_params_}")
# Final Model Evaluation
final_results = {}
for name, model in best_models.items():
print(f"\nTraining {name} with best parameters...")
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
final_results[name] = {"MSE": mse, "R2 Score": r2}
print(f"{name} Final Evaluation - MSE: {mse}, R2 Score: {r2}")
# Displaying Final Model Performance Summary
final_results_df = pd.DataFrame(final_results).T
print("\nFinal Model Performance Summary:")
print(final_results_df)
# Feature Importance Analysis (for Tree-Based Models)
def plot_feature_importance(model, model_name):
if hasattr(model, 'feature_importances_'):
importance = model.feature_importances_
features = X_train.columns
sorted_idx = np.argsort(importance)[::-1]
plt.figure(figsize=(10, 5))
plt.bar(range(len(features)), importance[sorted_idx], align='center')
plt.xticks(range(len(features)), np.array(features)[sorted_idx], rotation=90)
plt.title(f"Feature Importance - {model_name}")
plt.xlabel("Features")
plt.ylabel("Importance Score")
plt.show()
# Plot Feature Importances
for name, model in best_models.items():
plot_feature_importance(model, name)
# Residual Analysis
plt.figure(figsize=(10, 5))
for name, model in best_models.items():
y_pred = model.predict(X_test_scaled)
residuals = y_test - y_pred
sns.histplot(residuals, bins=30, kde=True, label=name)
plt.title("Residual Analysis - Distribution of Errors")
plt.xlabel("Residual Error")
plt.ylabel("Frequency")
plt.legend()
plt.show()
# Actual vs Predicted Visualization
plt.figure(figsize=(10, 5))
for name, model in best_models.items():
y_pred = model.predict(X_test_scaled)
plt.scatter(y_test, y_pred, alpha=0.5, label=name)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r', lw=2)
plt.title("Actual vs Predicted Values")
plt.xlabel("Actual Quality")
plt.ylabel("Predicted Quality")
plt.legend()
plt.show()
# Choosing the Best Model
best_model_name = final_results_df['R2 Score'].idxmax()
best_model = best_models[best_model_name]
print(f"\nBest Model Selected: {best_model_name}")
# Evaluating the Best Model on Test Set
y_pred_best = best_model.predict(X_test_scaled)
best_mse = mean_squared_error(y_test, y_pred_best)
best_r2 = r2_score(y_test, y_pred_best)
print(f"\nPerformance of Best Model ({best_model_name}) on Test Set:")
print(f"MSE: {best_mse}, R2 Score: {best_r2}")
# Feature Importance Analysis (for Tree-Based Models)
def plot_feature_importance(model, model_name):
if hasattr(model, 'feature_importances_'):
importance = model.feature_importances_
features = X_train.columns
sorted_idx = np.argsort(importance)[::-1]
plt.figure(figsize=(10, 5))
plt.bar(range(len(features)), importance[sorted_idx], align='center')
plt.xticks(range(len(features)), np.array(features)[sorted_idx], rotation=90)
plt.title(f"Feature Importance - {model_name}")
plt.xlabel("Features")
plt.ylabel("Importance Score")
plt.show()
# Plot Feature Importances
plot_feature_importance(best_model, best_model_name)
# Residual Analysis for Best Model
plt.figure(figsize=(10, 5))
y_pred = best_model.predict(X_test_scaled)
residuals = y_test - y_pred
sns.histplot(residuals, bins=30, kde=True, label=best_model_name)
plt.title("Residual Analysis - Distribution of Errors")
plt.xlabel("Residual Error")
plt.ylabel("Frequency")
plt.legend()
plt.show()
# Actual vs Predicted Visualization for Best Model
plt.figure(figsize=(10, 5))
plt.scatter(y_test, y_pred, alpha=0.5, label=best_model_name)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r', lw=2)
plt.title(f"Actual vs Predicted Values - {best_model_name}")
plt.xlabel("Actual Quality")
plt.ylabel("Predicted Quality")
plt.legend()
plt.show()
T_data_3_1, T_data_3_2, T_mean_4, T_mean_3AH_data (Relative Humidity) is the least important in Random Forest but was important in XGBoost.
Insight:
Temperature is the key factor in coffee quality prediction, while humidity has less impact in Random Forest but was significant in XGBoost.
Errors are centered around zero, confirming balanced predictions.
Insight:
The model is stable, making consistent predictions without major overfitting.
Some spread in lower quality ranges, meaning minor errors for lower quality coffee beans.
Insight:
Random Forest generalizes well across different quality levels.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import joblib
# Creating a pipeline for production deployment
best_pipeline = Pipeline([
('scaler', StandardScaler()),
('regressor', RandomForestRegressor(n_estimators=100, min_samples_split=10, min_samples_leaf=2, max_depth=None, random_state=42))
])
# Training the final pipeline
print("Training final production model...")
best_pipeline.fit(X_train, y_train)
# Evaluating on the test set
y_pred_best = best_pipeline.predict(X_test)
best_mse = mean_squared_error(y_test, y_pred_best)
best_r2 = r2_score(y_test, y_pred_best)
print("\nFinal Production Model Performance:")
print(f"MSE: {best_mse}, R2 Score: {best_r2}")
# Saving the trained model for deployment
joblib.dump(best_pipeline, "coffee_quality_model.pkl")
print("\nModel saved as 'coffee_quality_model.pkl'")
# Feature Importance Analysis
if hasattr(best_pipeline.named_steps['regressor'], 'feature_importances_'):
importance = best_pipeline.named_steps['regressor'].feature_importances_
features = X_train.columns
sorted_idx = np.argsort(importance)[::-1]
plt.figure(figsize=(10, 5))
plt.bar(range(len(features)), importance[sorted_idx], align='center')
plt.xticks(range(len(features)), np.array(features)[sorted_idx], rotation=90)
plt.title("Feature Importance - Production Model")
plt.xlabel("Features")
plt.ylabel("Importance Score")
plt.show()
# Residual Analysis
plt.figure(figsize=(10, 5))
residuals = y_test - y_pred_best
sns.histplot(residuals, bins=30, kde=True, label="Production Model")
plt.title("Residual Analysis - Production Model")
plt.xlabel("Residual Error")
plt.ylabel("Frequency")
plt.legend()
plt.show()
# Actual vs Predicted Visualization
plt.figure(figsize=(10, 5))
plt.scatter(y_test, y_pred_best, alpha=0.5, label="Production Model")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r', lw=2)
plt.title("Actual vs Predicted Values - Production Model")
plt.xlabel("Actual Quality")
plt.ylabel("Predicted Quality")
plt.legend()
plt.show()
1️. Model Performance (Final Test Results)
31.41 (low error, indicating good predictions)R2 Score: 0.88 (strong predictive capability)
Insight:
The Random Forest Model generalizes well, making it suitable for real-world deployment.
2️. Feature Importance (Production Model)
Top Features:
T_data_3_1, T_data_3_2, T_mean_4, T_mean_3Insight:
This model could be optimized further by focusing on the most relevant temperature sensors.
3️. Residual Analysis
Most errors center around 0, meaning the model has balanced predictions.
Insight:
No major underfitting or overfitting, confirming stable production performance.
4️. Actual vs Predicted Values
Some spread at lower quality values, meaning slight underestimation of lower quality beans.
Insight:
The model performs well across all quality levels, with minor variations in low scores.
1️. Temperature is the Primary Factor Influencing Coffee Quality
2️. Humidity Has a Limited but Notable Impact
3️. Model is Reliable for Pricing Strategies
4️. Model is Stable and Ready for Production
5️. Future Enhancements for Continuous Improvement
Integrate Real-Time Sensor Data
Fine-Tune the Model Using New Features
T_mean * AH_data).Deploy Model for Live Quality Monitoring
The model provides a highly reliable, data-driven approach to coffee quality assessment, enabling automated pricing and operational efficiencies.